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Abstract 

Current  state-of-the-art  systems  for  visual  content  anal¬ 
ysis  require  large  training  sets  for  each  class  of  interest, 
and  performance  degrades  rapidly  with  fewer  examples.  In 
this  paper,  we  present  a  general  framework  for  the  zero- 
shot  learning  problem  of  performing  high-level  event  de¬ 
tection  with  no  training  exemplars,  using  only  textual  de¬ 
scriptions.  This  task  goes  beyond  the  traditional  zero-shot 
framework  of  adapting  a  given  set  of  classes  with  training 
data  to  unseen  classes.  We  leverage  video  and  image  collec¬ 
tions  with  free -form  text  descriptions  from  widely  available 
web  sources  to  learn  a  large  bank  of  concepts,  in  addition 
to  using  several  off-the-shelf  concept  detectors,  speech,  and 
video  text  for  representing  videos.  We  utilize  natural  lan¬ 
guage  processing  technologies  to  generate  event  descrip¬ 
tion  features.  The  extracted  features  are  then  projected  to  a 
common  high-dimensional  space  using  text  expansion,  and 
similarity  is  computed  in  this  space.  We  present  extensive 
experimental  results  on  the  large  TRECVID  MED  [26]  cor¬ 
pus  to  demonstrate  our  approach.  Our  results  show  that  the 
proposed  concept  detection  methods  significantly  outper¬ 
form  current  attribute  classifiers  such  as  Classemes  ES, 
ObjectBank  [2  l\l,  and  SUN  attributes  H28H.  Further,  we  find 
that  fusion,  both  within  as  well  as  between  modalities,  is 
crucial  for  optimal  performance. 

1.  Introduction 

Popular  websites  such  as  YouTube,  Google  images,  and 
Flickr  contain  large  volumes  of  image  and  video  data  from 
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a  multitude  of  consumer  devices  such  as  digital  and  cell¬ 
phone  cameras.  Technologies  that  can  rapidly  analyze 
such  content  and  detect  salient  concepts  and  events  have 
several  compelling  applications.  Significant  progress  has 
been  made  in  developing  such  technologies  and  the  core 
of  most  state-of-the-art  methods  is  based  on  the  bag-of- 
words  model  0.  Here,  we  first  extract  low-level  features 
that  capture  salient  gradient  ll22l  l4l.  color  |[35l,  or  motion 
Mm  patterns,  project  them  to  a  pre -trained  codebook  in 
the  same  feature  space,  and  then  aggregate  the  projections 
to  get  the  final  image  or  video  level  feature  vector.  Classi¬ 
fiers,  typically  kernel  support  vector  machines  (SVM),  are 
then  trained  using  labeled  data.  This  approach  requires  a 
large  number  of  training  examples  for  each  class  of  interest 
and  performance  decreases  rapidly  as  the  training  set  size 
decreases. 

In  this  paper,  we  study  the  problem  of  video  classifica¬ 
tion  using  only  a  textual  description  of  the  events  of  interest, 
without  exemplar  videos  pertaining  to  the  events.  This  zero- 
shot  framework,  where  we  perform  video  classification  with 
zero  training  samples,  goes  beyond  traditional  zero-shot 
problems  such  as  described  in  l27l.  where  an  existing  set 
of  classes  with  training  data  is  adapted  to  an  unseen  class. 
We  pose  this  difficult  problem  of  video  classification  as  a 
retrieval  task,  where  an  event  is  described  as  a  query  de¬ 
fined  by  a  set  of  concepts,  e.g.  the  event  “driving  a  car” 
described  by  the  set  of  concepts  “drive,  car,  road,  person, 
face.”  We  aim  to  retrieve  videos  that  are  most  similar  to  the 
query,  where  the  similarity  score  is  treated  as  the  confidence 
of  the  video  belonging  to  that  event. 

Our  approach  to  zero-shot  learning  is  to  first  transform 
both  video  and  query  text  to  a  high-dimensional  concept 
space  before  computing  similarity  in  that  space.  For  the 
query,  we  apply  text  processing  techniques  to  obtain  a  vec¬ 
tor  of  salient  words  and  phrases  describing  the  event.  For 
the  video,  we  apply  a  bank  of  concept  detectors  to  obtain  a 
textual  representation  of  the  video  using  a  vector  of  detec¬ 
tion  scores.  Since  we  have  no  prior  knowledge  of  the  events 
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of  interest,  we  need  a  very  large  set  of  generic  concept  de¬ 
tectors  in  order  to  provide  semantic  coverage  of  all  possi¬ 
ble  queries.  To  address  this  challenge,  we  utilize  multiple 
concept  detectors  from  different  modalities:  visual  features, 
including  video  concepts  and  multiple  query  fusion  Q  of 
multiple  features  described  in  this  paper,  in  addition  to  off- 
the-shelf  detectors  such  as  Classemes  f34j,  ObjectBank  ED 
and  SUN  attributes  1281 :  audio  information  from  concepts 
learned  on  low-level  MFCC  features;  and  text  from  video 
text  and  speech  transcriptions. 

Once  we  represent  both  query  and  videos  as  vectors  of 
concept  scores,  we  can  compute  similarities  to  retrieve  rele¬ 
vant  videos.  A  key  challenge  here  is  the  mismatch  between 
query  and  video  concept  vocabularies.  We  utilize  a  text 
expansion  based  method  to  project  query  and  video  con¬ 
cept  vectors  to  a  common  high-dimensional  concept  space 
where  they  are  compared,  using  the  large  text  corpus  Giga- 
word  lfl4l  to  learn  this  projection  matrix.  Finally,  we  fuse 
retrievals  from  each  of  the  features  and  modalities  using  a 
simple  linear  combination  to  exploit  the  complementary  na¬ 
ture  of  the  different  modalities  and  concept  vocabularies. 

The  paper  is  organized  as  follows:  in  Section [2]  we  dis¬ 
cuss  related  approaches  to  similar  problems.  In  Section  [3] 
we  present  an  overview  of  our  zero-shot  learning  frame¬ 
work.  Section  Q]  describes  the  features  we  extract  from 
video  and  Section  0  outlines  the  combination  of  these  fea¬ 
tures.  We  report  experimental  results  in  Section[6]  and  dis¬ 
cuss  our  conclusions  in  Section]?] 

2.  Related  Work 

Extensive  research  has  been  performed  in  recent  years 
on  effective  representation  and  classification  of  images  and 
videos.  The  first  step  in  most  techniques  is  to  extract  low- 
level  features  from  local  spatial  or  spatio-temporal  patches. 
Popular  features  include  grayscale  appearance  features  such 
as  SIFT  l22l  and  SURF  Q,  color  features  such  as  Color 
SIFT  l35l.  and  motion  features  such  as  STIP  li20l  and  dense 
trajectories  ea.  These  typically  extract  thousands  to  mil¬ 
lions  of  feature  vectors  per  image  or  video.  They  are  ag¬ 
gregated  to  a  single  fixed  dimensional  representation  by 
a  sequence  of  coding  and  pooling  steps.  Possible  coding 
techniques  include  Hard  Quantization  0,  Soft  Quantiza¬ 
tion  11361,  Sparse  Coding  0  and  Fisher  Vectors  ||32j,  using 
a  codebook  trained  in  an  unsupervised  manner  from  a  large 
set  of  feature  vectors.  The  coded  features  are  then  aggre¬ 
gated,  typically  using  average  or  max  pooling,  and  classi¬ 
fied  typically  using  support  vector  machines  (SVM). 

While  this  approach  has  shown  strong  results  given  a 
large  training  set,  performance  degrades  rapidly  as  the 
amount  of  training  data  decreases  and  the  method  does  not 
generalize  to  previously  unseen  events.  Only  limited  atten¬ 
tion  has  been  paid  to  this  challenging  problem  and  most  ex¬ 
isting  approaches  introduce  an  intermediate  layer  of  seman¬ 


tic  concepts,  which  are  then  used  to  describe  novel  classes. 
Semantic  output  codes  (SOC)  are  proposed  in  l(27l  to  ex¬ 
trapolate  novel  classes  by  utilizing  a  knowledge  base  of  se¬ 
mantic  properties  of  known  classes.  A  large  scale  ontology 
is  used  in  OTft  to  learn  visual  relationships  between  objects, 
while  lf30ll  uses  knowledge  transfer  between  object  classes. 
An  online  incremental  attribute  based  zero-shot  learning  ap¬ 
proach  is  presented  in  lfl7l.  while  a  max-margin  formulation 
is  proposed  in  ED  for  zero-shot  multi-label  classification 
where  the  label  correlations  on  the  training  set  differ  sig¬ 
nificantly  from  the  test  set.  A  constrained  optimization  for¬ 
mulation  that  combines  regression  and  knowledge  transfer 
based  functions  has  recently  been  proposed  in  H 1 2| . 

All  of  these  techniques  rely  on  extrapolating  from  an  ex¬ 
isting  set  of  classes  and  training  data.  The  more  difficult 
task  of  performing  video  retrieval  and  classification  with  no 
prior  event  knowledge  or  training  data  has  been  addressed 
only  recently.  In  contrast  to  fS],  we  introduce  several  ways 
to  generate  a  large  visual  and  audio  concept  lexicon  without 
prior  knowledge  of  the  event  classes,  and  present  a  simple 
unified  framework  for  effectively  combining  visual,  audio, 
and  textual  information.  While  we  are  not  able  to  bench¬ 
mark  our  method  against  0  since  we  do  not  have  access 
to  their  concept  lexicon  or  data  partitions,  our  results  in  the 
TRECVID  evaluation  (Section  |6.6|l  compare  favorably  to 
systems  using  similar  approaches. 

Video  retrieval  using  semantic  similarity  has  previously 
been  explored  in  EJ)T6]|.  However,  these  approaches  fo¬ 
cus  on  highly  structured  broadcast  data,  where  a  small  374 
concept  pool  |2j  can  be  adequate.  In  contrast,  we  focus  on 
more  challenging  unconstrained  web  data  where  leveraging 
multiple  modalities  and  larger  concept  banks  is  important  to 
build  a  robust  system.  While  |2]  [16]  both  use  a  pre-defined 
concept  ontology,  we  demonstrate  the  benefit  of  training  in¬ 
domain  detectors  in  a  data  driven  manner  by  discovering 
concepts  from  free  form  text  descriptions. 

There  has  also  been  an  increasing  interest  in  joint  mod¬ 
eling  of  text  and  visual  features  0,  which  can  then  poten¬ 
tially  be  used  to  generate  a  text  description  of  query  im¬ 
ages  (13]  [FT  (38;  24]  and  videos  Ifl8ll9l.  A  large  scale  study 
of  the  relationship  between  semantic  similarity  of  classes 
and  confusion  between  them  is  presented  in  ED-  In  ED, 
a  large  text  corpus  is  used  to  learn  a  semantic  space  using 
word  distributions  and  a  separate  model  is  trained  for  seen 
and  unseen  classes.  However,  given  the  training  data  lim¬ 
itations  in  our  problem,  we  constrain  our  focus  to  attribute 
mappings  produced  using  off-the-shelf  features  l34ll28l[2T1. 
novel  concept  banks  developed  with  video-caption  pairs 
similar  to  ED,  and  speech  and  video  text  output. 

3.  Zero-shot  Learning  Framework 

Figure  [I]  displays  an  overview  of  our  multi-modal  zero- 
shot  learning  approach,  which  involves  applying  C  different 
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Figure  1.  Overview  of  the  proposed  multi-modal  zero-shot  learning  approach. 


concept  banks  on  each  video  v.  Let  L  =  { q ......  ciK  }  be 

a  lexicon  defined  by  K  concepts  C[j.  for  concept  bank  l  £ 
[1 , ,C\.  Each  concept  bank  provides  a  A'-dimensional 
vector  of  detection  scores  d,,  =  [d;.,  . . .  diK]T  for  each 
video  v  =  1 ...  V,  that  is  £2 -normalized;  i.e.,  ||d„||2  =  1. 
Given  a  query  Q  =  {cqi,  ...cqN}  defined  by  N  concepts 
cq,n,  we  aim  to  retrieve  videos  that  are  similar  to  the  query. 

3.1.  Basic  Similarity  Computation 

We  first  present  a  direct  model  to  measure  video-query 
similarity.  In  this  model,  we  compute  the  similarity  score 
Sq(v)  between  a  query  Q  and  a  video  v  as  a  sum  of  the 
concept  scores  of  the  lexicon  that  match  the  query  concepts: 

1  K 

Sq(v)  =  ^^2dhlQiClk)  (!) 

fc= 1 

where  1q(c/(:  )  is  an  indicator  function  of  the  presence  of 
concept  Cik  in  query  Q. 

This  baseline  system  is  very  precise  for  efficient  concept 
detectors.  We  expect  the  system  to  perform  well  when  there 
is  a  large  match  between  the  query  and  video  concepts,  but 
lexicon  coverage  of  the  query  will  limit  recall  while  noise 
in  the  video  concept  detections  will  degrade  precision. 

3.2.  Expansion-based  Similarity  Computation 

To  address  the  issue  of  vocabulary  mismatch  between 
query  and  video,  we  use  an  alternative  model  to  measure 
video-query  similarity.  In  this  model,  concepts  are  ex¬ 
panded  and  projected  to  a  common  global  concept  space 
defined  by  the  lexicon  L.  The  goal  is  to  propagate  existing 


confidence  scores  to  semantically  similar  concepts  using  the 
knowledge  from  a  text  corpus  (like  Gigaword)  to  estimate 
similarity.  Let  G  :  (ci,  C2)  — ►  s  be  a  text  model  that  mea¬ 
sures  the  similarity  s  £  [0,1]  between  two  concepts  c\  and 
C2-  Let  an  item  I  in  the  database  be  represented  by  a  set  of 
triplets  describing  the  concept  name,  its  confidence  score, 
and  its  index  in  the  lexicon  L.  The  expansion-based  projec¬ 
tion  method  is  given  in  Algorithm]!] 


Algorithm  1  Expansion-based  projection. 

Given  an  item  I  =  {(ci,  si,  ii), . . . ,  ( cn ,  sjv,  ijv)}- 
Let  f  £  Rk  be  the  projected  feature  vector  of  item  I  for  L. 
Initialization:  fk  =  0  for  k  —  1 ...  K. 
for  each  (c,  s,  i )  in  I  do 
fi  <—  fi  +  s 

Find  the  top  T  similar  concepts  of  c  in  G,  given  as 

It  =  {(ci,  si,  ii), . . . ,  (ct,  st,  *t)}, 
where  st  =  G(c,  Ct ). 

Update  the  feature  for  the  similar  concepts: 
for  each  (c*,  St,  it)  in  It  do 
fn  fh  +  s  ■  st 

end  for 
end  for 

Normalize  the  feature  vector  || f  || 2  =  L 


This  algorithm  obtains  the  projected  vector  f  of  an  item  I 
in  two  steps  for  each  concept.  The  first  step  finds  the  top  T 
similar  concepts  using  the  model  G.  The  second  step  boosts 
the  scores  of  the  similar  concepts  for  an  item  by  the  amount 
of  similarity  between  the  concepts.  The  final  feature  vector 
f  is  then  normalized  for  comparison  purposes. 

Algorithm  |T]  is  applied  to  expand  both  the  query  and 
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database  concepts  to  a  common  lexicon  space.  Query  con¬ 
cept  confidences  are  given  as  1,  while  database  concept  con¬ 
fidences  are  given  by  the  output  of  the  concept  detectors. 
Once  the  expanded  feature  vectors  f q  £  WtK  representing 
the  query  Q  and  {v  £  R K  representing  the  video  v  have 
been  obtained,  the  similarity  between  the  query  Q  and  the 
video  v  is  computed  as 

SQ(v)=fTfv.  (2) 

Note  that  other  similarity  measures  may  also  be  consid¬ 
ered  (e.g.,  Laplacian  or  RBF  kernels),  although  in  our  ex¬ 
periments  we  find  that  |2)  has  the  best  performance. 

4.  Video  Feature  Extraction 


second.  From  each  frame,  we  compute  14  mel-frequency 
warped  cepstral  coefficients.  The  resulting  45 -dimensional 
feature  vector  captures  the  short-time  spectral  structure  of 
the  audio  stream. 

For  each  of  the  above  low-level  features,  we  first  apply 
principal  component  analysis  (PCA)  to  reduce  the  dimen¬ 
sionality  and  whiten  the  feature  vectors.  For  each  video,  we 
then  obtain  a  set  X  =  {xt  £  K15,  t  =  1 . . .  T}  of  T  low- 
level  low-dimensionality  feature  descriptors.  We  assume 
that  these  features  are  distributed  according  to  a  Gaussian 
mixture  model  (GMM)  with  diagonal  covariance  matrix: 

K 

p(xt|A)  =  ^2  //fc,  al),  for  t  =  1 . . .  T.  (3) 

k=  1 


Since  existing  concept  banks  are  generally  trained  on  out 
of  domain  data  and  may  not  contain  a  large  enough  vocabu¬ 
lary  to  cover  possible  queries,  we  propose  multiple  methods 
to  rapidly  learn  new  concept  detectors  with  easily  collected 
data  from  readily  available  in-domain  and  web  sources. 

4.1.  Weakly  Supervised  Concepts  (WSC) 

We  train  a  set  of  WSCs  for  concept  detection  in  videos 
using  the  following  steps: 

4.1.1  Data  Collection  and  Concept  Discovery 


The  GMM  parameters 

A  =  { Wk  £  [0, 1] ,  Hk  £  K13 ,  &k  £  R33,  k  =  1 . . .  K) 

are  learned  on  a  training  set  through  maximum  likelihood 
estimation.  We  then  consider  the  Fisher  vector  encoding 
as  proposed  in  l29l  and  represent  each  video  by  the  nor¬ 
malized  gradients  of  the  GMM  log-likelihood  G\-h  £  R " 
and  Q^k  £  R D  with  respect  to  the  Gaussian  mean  ///- 
and  standard  deviation  parameters  respectively.  For 
k  =  1 . . .  K,  these  .D-dimensional  normalized  gradients  are 
defined  affj] 


We  collect  a  set  of  videos  with  free-form  text  descriptions 
of  their  content.  Such  data  is  widely  available  online  in 
websites  such  as  YouTube  and  also  in  the  research  set  of 
the  considered  TRECVID  MED  dataset.  We  apply  standard 
natural  language  processing  (NLP)  techniques  to  clean  up 
the  annotations,  including  removal  of  common  stop  words 
and  stemming  to  normalize  word  inflections.  The  remaining 
vocabulary  is  taken  as  our  concept  dictionary. 

4.1.2  Low-level  feature  extraction 


=  RsS1‘I'|A|  ( 


Xt  -  /ifc 
&  k 


6"  =  TV 5StS7‘(x*|A) 


(xt  -  Hkf 


where  the  posterior  probability 

tcfeAA(xt;^fe,Sfe) 


7fc(xt|A)  = 


Et l  wlN (xt;  Hi,  Hi) 


(4) 

1  ,(5) 


For  each  video  in  the  collected  corpus,  we  extract  the  fol¬ 
lowing  set  of  low-level  visual  and  audio  features: 

D-SIFT  BU:  This  is  a  dense  version  of  SIFT  where,  in¬ 
stead  of  detecting  interest  points,  the  128-dimensional  fea¬ 
ture  vectors  are  extracted  at  uniformly-sampled  locations 
covering  the  whole  image.  D-SIFT  typically  generates  3x 
the  number  of  points  produced  by  SIFT  f22l  and  has  been 
shown  to  outperform  SIFT  for  image  classification  0. 
Dense  Trajectories  (DT)  1371:  This  feature  represents  the 
video  using  dense  optical  flow  trajectories.  Histogram  of 
oriented  gradients  (HoG)  and  motion  boundary  histograms 
(MBH)  are  extracted  from  the  local  spatio-temporal  neigh¬ 
borhood  of  each  track  to  capture  salient  appearance  and  mo¬ 
tion  patterns  respectively. 

MFCC  (H:  These  popular  audio  features  are  extracted 
from  overlapping  29  ms  frames  at  a  rate  of  100  frames  per 


is  the  soft  assignment  of  the  feature  descriptor  xf  to  the  fc-th 
Gaussian  cluster.  The  final  Fisher  vector  is  the  concatena¬ 
tion  of  the  K  D-dimensional  normalized  gradients  G'x  and 
Gx'1 ,  and  is  thus  of  dimension  2KD. 

4.1.3  Classifier  Training 

For  each  concept  identified  in  Section  |4.1.1[  we  collect  all 
videos  for  which  that  concept  occurs  in  the  text  caption, 
and  utilize  them  as  our  positive  training  set,  with  the  re¬ 
maining  videos  considered  as  negatives.  We  then  train  RBF 
kernel-based  support  vector  machine  (SVM)  classifiers  us¬ 
ing  the  Fisher  vectors  representing  the  videos.  We  train  a 
set  of  concept  detectors  for  each  of  the  low-level  features 
(D-SIFT,  DT,  MFCC)  described  in  Section|4X2l 

1  Vector  multiplications  and  divisions  are  element-wise  operations  here. 
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4.1.4  Weakly  Supervised  Concept  Feature 

Given  a  video,  we  produce  a  compact  representation  by  con¬ 
catenating  the  detection  scores  of  our  concept  detectors.  We 
use  this  feature  vector  for  event  detection  and  refer  to  this 
representation  as  WSC ,  for  weakly-supervised  concepts. 


4.2.  Concept  Training  using  Web  Data 


In  addition  to  the  concept  detectors  trained  using  the  re¬ 
search  set  described  in  Section  |4. 1 . 1 1  we  also  train  detec¬ 
tors  using  data  downloaded  from  the  web.  For  each  con¬ 


cept  identified  in  Section  4.1.1  we  downloaded  the  top  100 


retrievals  from  Google  images  and  thumbnails  for  the  top 
50  retrievals  from  YouTube.  We  then  train  WSCs  with  this 


data  using  the  same  approach  as  described  in  Section  4. 1 


We  call  the  WSCs  trained  using  the  TRECVID  research  set, 
Google  images  and  YouTube  thumbnails  as  WSCtrecvid, 
WSCGoogie  and  WSCYouTube  respectively. 


4.3.  Concept  Distance  Features 

We  also  introduce  a  novel  concept  distance  (CD)  based 
feature.  Let  C  denote  the  set  of  concepts  identified  from  the 
text  annotations  in  Section  [4. 1 , 1 1  For  each  concept  c  £  C, 
let  Vc  denote  the  set  of  videos  in  the  research  set  contain¬ 
ing  the  concept.  Let  x,  denote  the  low-level  feature  based 
vector  extracted  for  video  i.  Then,  we  compute  the  feature 
vector  yc  for  the  concept  c  as: 


yc 


l 


Ex- 


(6) 


Given  a  new  video  v  and  its  low-level  feature  vector  x,,, 
we  obtain  the  CD  feature  vector  by  computing  the  distance 
to  each  yc  in  (6)  and  concatenating: 

CD„  =  [||x„  —  yi||2  ...  ||x„  -y|c|||2]T-  (7) 

In  our  experiments,  we  use  D-SIFT,  DT  and  MFCC  low- 
level  feature  vectors.  The  proposed  feature  vector  builds 
on  multiple -queries  (MQ)  m  and  the  query  expansion  © 
based  techniques  proposed  previously.  While  these  ap¬ 
proaches  identify  relevant  videos  at  query  time  and  use  the 
retrievals  to  expand  the  concept  set  or  training  set,  we  use 
a  static  set  of  concept  vectors  yc  and  compute  distances  at 
query  time  to  these  vectors. 

4.4.  Off-the-shelf  Concept  Detectors 

We  also  test  three  off-the-shelf  concept  detectors  that 
have  been  used  in  recent  literature: 

Classemes  [34|:  This  is  a  bank  of  concept  detectors  trained 
on  images.  These  were  chosen  using  a  large  ontology  of 
visual  concepts.  Given  an  image  or  a  video  frame,  the  ap¬ 
plication  of  all  these  detectors  yields  a  2,659-dimensional 
vector  of  detection  scores. 


ObjectBank  (2D:  Here,  we  use  a  spatial  pyramid  represen¬ 
tation  of  images  and  produce  detection  confidence  scores  at 
different  scales  and  spatial  pyramids  for  each  concept.  The 
concept  detectors  are  trained  using  linear  SVMs  and  an  im¬ 
age  is  represented  by  concatenating  the  detection  scores  of 
different  concepts  at  different  scales  and  spatial  pyramids. 
SUN  Attributes  (28):  The  SUN  attribute  set  contains  de¬ 
tectors  for  102  scene  attributes  that  were  specified  using 
crowd  sourced  human  studies. 

We  apply  each  of  these  concept  detectors  on  a  set  of 
frames  uniformly  sampled  from  a  video  and  then  average 
the  detection  scores  across  the  video  to  get  the  final  video¬ 
level  feature  vector. 

4.5.  Automatic  Speech  Recognition  (ASR) 

We  use  GMM-based  speech  activity  detection  (SAD) 
and  a  hidden  Markov  model  (HMM)  based  multi-pass  large 
vocabulary  ASR  to  obtain  speech  content  in  the  video,  and 
encode  the  hypotheses  in  the  form  of  word  lattices. 

We  first  extract  MFCC  features  from  the  audio  stream. 
Then,  the  speech  segments  are  identified  by  using  a  speech 
activity  detection  (SAD)  system  that  employs  two  GMMs, 
for  speech  and  non-speech  observations  respectively.  The 
SAD  model  incorporates  video  clips  with  music  content  to 
enrich  the  non-speech  model,  in  order  to  handle  the  hetero¬ 
geneous  audio  in  consumer  video.  Given  the  automatically 
detected  speech  segments,  we  then  apply  a  large-vocabulary 
ASR  system  to  the  speech  data  to  produce  a  transcript  of  the 
spoken  content.  The  system  is  adapted  from  an  ASR  sys¬ 
tem  trained  on  English  Broadcast  News,  and  updated  with 
MED  2011  descriptor  files  l25l.  relative  web  text  data,  and 
the  small  set  of  annotated  consumer  video  data.  We  evalu¬ 
ated  the  ASR  model  on  a  held-out  set  of  100  video  clips  and 
achieved  a  Word  Error  Rate  (WER)  of  35.8%.  The  system 
outputs  not  only  the  1-best  transcripts  but  also  word  lattices 
with  acoustic  and  language  model  scores. 

After  basic  processing  to  remove  stop  words  and  nor¬ 
malize  word  inflections,  the  word  lattice  posteriors  are  used 
to  generate  the  concept  score  vectors  used  in  the  zero  shot 
projection  system. 

4.6.  Optical  Character  Recognition  (OCR) 

Our  OCR  system  recognizes  text  in  bounding  boxes  from 
a  video  text  detector  using  an  HMM-based  multi-pass  large 
vocabulary  OCR  system.  Similar  to  our  ASR  system,  word 
lattices  are  used  to  encode  alternative  hypotheses.  We  lever¬ 
age  a  statistically  trained  video  text  detector  based  on  SVM 
to  estimate  video  text  bounding  boxes. 

Text  candidate  regions  are  first  selected  using  Maxi¬ 
mally  Stable  Extremal  Regions  (MSER)  and  filtered  using 
an  SVM  with  rich  shape  descriptors  such  as  Histogram  of 
Oriented  Gradients  (HoG),  Gabor  filter,  corners  and  geo¬ 
metrical  features.  Candidate  regions  are  then  grouped  to 
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form  word  boundaries,  and  detected  words  are  binarized 
and  filtered  before  being  passed  to  the  HMM-based  OCR 
system  for  recognition.  The  OCR  system  finds  a  sequence 
of  characters  that  maximizes  the  posterior,  by  using  glyph 
models  (similar  to  the  acoustic  models  in  ASR),  a  dictio¬ 
nary  and  N-gram  language  models.  The  word  precision  and 
recall  of  our  system  measured  on  a  small  consumer  video 
dataset  is  14.7%  and  37%  respectively. 

Since  the  video  text  content  presents  itself  in  various 
forms,  such  as  subtitles,  markup  titles  and  in-scene  text,  it 
is  much  more  challenging  than  conventional  scanned  doc¬ 
ument  OCR.  To  address  these  challenges,  we  consider  two 
versions  of  OCR:  one  which  utilizes  the  dictionary  and  N- 
gram  language  model,  and  one  which  is  character-based. 
While  the  language  model  corrects  character-level  tran¬ 
scription  errors,  it  also  introduce  errors  when  falsely  cor¬ 
recting  out  of  vocabulary  words.  For  the  word  model  OCR 
output,  we  generate  a  concept  score  vector  from  the  word 
lattice  posteriors  in  the  same  way  as  ASR.  For  the  charac¬ 
ter  based  model,  we  estimate  word  posteriors  by  smoothing 
character  errors  across  adjacent  video  frames  to  produce  a 
concept  score  vector.  In  our  experiments  we  find  the  char¬ 
acter  model  to  be  slightly  better  for  video  than  the  word 
model,  as  detailed  in  Section[6] 

5.  Fusion 

State  of  the  art  systems  for  standard  event  detection  with 
training  data  have  shown  fusion  of  multiple  features  and 
modalities  to  be  crucial  for  improving  performance  ll23l. 
Fusion  is  especially  important  for  the  zero-shot  problem, 
due  to  the  sparse  occurrence  of  speech  and  video  text  con¬ 
tent,  as  well  as  the  limited  vocabulary  intersection  between 
a  given  concept  bank  and  query.  While  we  do  not  have  any 
training  data  on  which  to  learn  parameters  for  more  sophis¬ 
ticated  fusion  methods,  we  find  that  simple  score  averag¬ 
ing  works  well  to  exploit  the  complementary  information 
in  various  systems.  We  further  see  some  benefit  to  manu¬ 
ally  increasing  the  weights  of  the  higher  precision  ASR  and 
OCR  systems  in  fusion,  and  use  a  linearly  weighted  score 
combination  for  all  fusion  experiments  below. 

6.  Experiments 

We  test  our  approach  on  the  large  collection  of  consumer 
web  videos  from  the  TRECVID  MED  13  l26l  dataset.  The 
task  is  to  retrieve  videos  containing  one  of  20  diverse  high- 
level  multimedia  events,  each  described  by  a  short  text  doc¬ 
ument  of  —250  words.  The  dataset  provides  a  research  set 
that  contains  ~  12,000  background  videos  and  no  exemplars 
of  the  events  of  interest.  We  use  this  research  set  to  leant  our 
WSCtrecvid  and  CD  features.  We  report  on  the  designated 
MEDTest  set  containing  —25,000  videos.  More  details  of 
the  events  and  data  partitions  may  be  found  in  (26). 


6.1.  Comparison  of  Similarity  Computation 


Feature  Basic  (MAP)  Expanded  (MAP) 


ASR 

3.27% 

3.66% 

OCR  (character) 

4.43% 

4.72% 

Cdmfcc 

1.04% 

1.04% 

W  7C  z"1  D-SIFT 

W  ‘J'— YouTube 

3.42% 

3.48% 

Table  1 .  Mean  average  precision  (MAP)  comparison  between  ba¬ 
sic  0  and  expanded  0  query-video  similarity  computation  for 
our  single  best  ASR,  OCR,  audio,  and  visual  features. 

Table  HI  compares  the  two  methods  of  query-video  simi¬ 


for  the  best  feature  in  each  modality.  We  observe  that  ex¬ 
pansion  consistently  improves  over  the  simple  approach. 
We  observed  similar  gains  from  using  projection  based  fea¬ 
tures  in  fusion,  and  thus  we  use  the  expansion-based  ap¬ 
proach  in  all  experiments  below. 

6.2.  Comparison  of  Visual  Features 


larity  computation  discussed  in  Section  3.1  and  Section  3.2 


Feature 

MAP 

AUC 

SUN  (28) 

0.48% 

0.605 

ObjectBank  (21 1 

0.77% 

0.592 

Classemes  (34) 

0.84% 

0.630 

Cddsift 

1.71% 

0.770 

CDdt 

2.28% 

0.779 

xirr  /"''D-SIFT 

W  OtTRECV|D 

1.92% 

0.735 

WSCtrecvid 

2.76% 

0.726 

YMQf’D-SIFT 
w  ^Google 

1.21% 

0.543 

WQrD-SIFT 

W  OV.  YouTube 

3.48% 

0.729 

Table  2.  Comparison  of  mean  average  precision  (MAP)  and  area 
under  the  curve  (AUC)  for  visual  features. 

In  these  experiments,  we  compare  our  proposed  WSC 
and  CD  features  to  several  off-the-shelf  detectors.  Table  [2] 
summarizes  our  results.  Here,  WSC!^^,  refers  to  the 
weakly  supervised  concept  features  trained  using  D-SIFT 
features  extracted  on  pre-downloaded  YouTube  thumbnails. 
Overall,  the  WSC^1,1^  feature  has  the  strongest  perfor¬ 
mance,  while  the  off-the-shelf  detectors  are  significantly 
weaker  than  our  proposed  approaches.  A  possible  reason 
for  this  is  the  large  domain  mismatch  between  the  data  used 
for  training  them  and  the  video  data.  The  same  issue  could 
explain  the  weaker  performance  of  the  WSCcoogie  features 
compared  to  WSCtrecvid  and  WSCyOUTube  due  to  the  do¬ 
main  mismatch  between  images  and  videos.  Moreover,  the 
CD  features  that  are  significantly  faster  to  extract  have  com¬ 
parable  performance  to  the  WSC  features  that  require  train¬ 
ing  expensive  SVMs.  Finally,  the  WSC  and  CD  features 
detected  using  DT  are  stronger  than  the  ones  using  D-SIFT. 
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6.3.  Comparison  of  Audio  Features 

Feature  MAP  AUC 

WSCxrrcvid  0.76%  0.507 

CpMFee  1.04%  Q.604 

Table  3.  Comparison  of  mean  average  precision  (MAP)  and  area 
under  the  curve  (AUC)  for  audio  features. 

We  compare  the  performance  of  our  WSC  and  CD  fea¬ 
tures  trained  using  the  audio  MFCC  features.  Table [3] sum¬ 
marizes  the  MAP  and  AUC  results.  As  observed,  both  of 
the  audio  features  are  weaker  than  the  visual  features. 

6.4.  Comparison  of  Language  Features 

Feature  MAP  AUC 

ASR  3.66%  0.583 

OCR  (word)  4.30%  0.636 

OCR  (character)  4.72%  0.611 

Table  4.  Comparison  of  mean  average  precision  (MAP)  and  area 
under  the  curve  (AUC)  for  language  features. 

Table[4]compares  the  performance  of  our  OCR  and  ASR 
systems.  All  the  systems  have  higher  MAP  compared  to  the 
visual  and  audio  features  from  Tables  [2]  and  [3]  However, 
note  that  the  AUCs  of  many  visual  features  outperform  the 
language  features.  This  is  because  although  language  con¬ 
tent,  when  present,  is  a  highly  accurate  source  of  informa¬ 
tion,  its  occurrence  is  sporadic,  leading  to  low  recall. 

6.5.  Comparison  of  Fusion  Systems 


Feature 

MAP 

AUC 

ASR 

3.66% 

0.583 

OCR 

5.87% 

0.642 

Audio 

1.04% 

0.623 

Visual  (CD  +  WSC) 

6.12% 

0.853 

Full 

12.65% 

0.733 

Table  5.  Comparison  of  mean  average  precision  (MAP)  and  area 
under  the  curve  (AUC)  for  fusion  systems. 

We  fused  each  of  the  individual  systems  described 
above,  both  within  each  modality  as  well  as  across  modali¬ 
ties.  Table  [5]  compares  the  performance  of  the  different  fu¬ 
sion  systems.  Note  that  within  the  visual  system,  we  found 
that  off-the-shelf  visual  features  did  not  improve  the  fused 
system,  and  only  included  our  CD  and  WSC  features.  While 
none  of  the  individual  visual  features  is  stronger  than  ASR 
or  OCR,  the  visual  system  is  the  single  strongest  system 
after  fusion,  gaining  ~75%  relative  improvement  over  the 
single  best  visual  system.  The  combined  OCR  system  also 


outperforms  the  individual  OCR  systems,  and  the  full  sys¬ 
tem  that  combines  all  modalities  more  than  doubles  the  per¬ 
formance  of  any  individual  modality  as  measured  by  MAP. 

6.6.  TRECVID  Performance 

The  zero-shot  event  detection  task  was  introduced  as 
a  pilot  training  condition  as  part  of  the  TRECVID  MED 
13  evaluations.  Independent  evaluations  were  conducted 
by  NIST  on  a  blind  ~  100000  video  dataset,  both  for  the 
same  20  events  as  in  our  previous  experiments  {prespeci¬ 
fied ),  as  well  as  for  10  new  events  given  one  week  before  the 
evaluation  {ad  hoc).  Our  zero-shot  system  achieved  highly 
competitive  scores  for  both  prespecified  and  ad  hoc  condi¬ 
tions,  placing  among  the  top  three  out  of  9  submissions.  In 
particular,  our  consistent  performance  between  prespecified 
and  ad  hoc  events  demonstrate  the  robustness  of  our  event- 
independent  approach  to  generalize  to  new  queries. 

7.  Discussion  and  Conclusion 

Only  limited  attention  has  been  devoted  to  the  task  of 
video  retrieval  using  only  text  queries.  We  present  a  system¬ 
atic  evaluation  of  our  zero-shot  framework  for  performing 
high-level  multimedia  event  detection  with  no  training  data, 
given  only  text  descriptions  of  the  events  of  interest.  Our 
findings  and  results  on  the  large  TRECVID  MED  dataset 
can  serve  as  an  initial  baseline  for  this  challenging  task. 

We  present  a  general  framework  for  zero-shot  learning, 
that  utilizes  multiple  multi-modal  features  to  map  a  video 
to  an  intermediate  semantic  attribute  space,  which  are  then 
projected  to  a  high-dimensional  concept  space  using  statis¬ 
tics  learned  on  a  large  text  corpus.  Similarity  between  the 
attributes  and  a  text  query  are  computed  in  this  space,  and 
the  scores  computed  from  different  attribute  sets  are  com¬ 
bined  to  get  the  final  score.  We  demonstrate  the  effective¬ 
ness  of  this  approach  for  aligning  disjoint  vocabularies  be¬ 
tween  query  and  various  modalities. 

We  describe  two  simple  but  effective  methods  for  rapidly 
training  new  concept  detectors  using  in-domain  as  well  as 
web  data  in  the  form  of  image/video  with  associated  text 
descriptions.  Detailed  experimental  results  show  that  our 
concept  detectors  significantly  outperform  off-the-shelf  de¬ 
tectors  for  zero-shot  retrieval  tasks.  Exploiting  the  comple¬ 
mentary  nature  of  speech  and  video  text  as  well  as  between 
different  concept  banks,  we  perform  multiple  rounds  of  fu¬ 
sion  to  produce  a  final  system  that  is  significantly  better  than 
any  individual  feature  or  modality. 
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